Fusing audio and video information for online speaker diarization

نویسندگان

  • Joerg Schmalenstroeer
  • Martin Kelling
  • Volker Leutnant
  • Reinhold Häb-Umbach
چکیده

In this paper we present a system for identifying and localizing speakers using distant microphone arrays and a steerable pan-tilt-zoom camera. Audio and video streams are processed in real-time to obtain the diarization information “who speaks when and where” with low latency to be used in advanced video conferencing systems or user-adaptive interfaces. A key feature of the proposed system is to first glean information about the speaker’s location and identity from the audio and visual data streams separately and then to fuse these data in a probabilistic framework employing the Viterbi algorithm. Here, visual evidence of a person is utilized through a priori state probabilities, while location and speaker change information are employed via time-variant transition probablities. Experiments show that video information yields a substantial improvement compared to pure audio-based diarization.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Weighted Oriented Optical Flow Histograms for Multimodal Speaker Diarization

Speaker diarization currently focuses on using audio features to partition an audio stream into speaker homogeneous speech regions, in other words to determine “who spoke when”. Recent speaker diarization corpora contains video recordings in addition to the commonly used audio. Thus, we investigated the benefits of incorporating video features, namely histograms of weighted oriented optical flo...

متن کامل

Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

Our goal is to create speaker models in audio domain and face models in video domain from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question ”Who was speaking and when”) and/or for face recognition (”Who was seen and when”) for given videos that contain speaking persons. The proposed system is based on an a...

متن کامل

Speaker diarization de fichiers vidéos hétérogènes issus du web

In the last ten years, Internet changed significantly. The main change is certainly the content of the Internet, in its quantity, its variety or the media used to show it. Regarding multimedia, the most impressive evolution is the continuous growing success of the video sharing websites. But, with this success come the difficulties to efficiently search, index and access relevant information ab...

متن کامل

A Survey on Speaker Diarization Approach for Audio and Video Content Retrieval

Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. In the speaker diarization methods can be used to determine the speech part and non-speech part of the recordings. There are different approaches can be evaluated for speaker diarization. Accordingly, many important imp...

متن کامل

Multimodal speaker diarization using oriented optical flow histograms

Speaker diarization is the task of partitioning an input stream into speaker homogeneous regions, or in other words, to determine ”who spoke when.” While approaches to this problem have traditionally relied entirely on the audio stream, the availability of accompanying video streams in recent diarization corpora has prompted the study of methods based on multimodal audio-visual features. In thi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009